An Association-Based Method for Automatic Indexing with a Controlled Vocabulary
نویسندگان
چکیده
In this paper we describe and test a two stage algorithm based on a lexical collocation technique which maps from the lexical clues contained in a document representation into a controlled vocabulary list of subject headings. Using a collection of 4,626 INSPEC documents, we create a “dictionary” of associations between the lexical items contained in the titles, authors and abstracts and controlled vocabulary subject headings assigned to those records by human indexers using a likelihood ratio statistic as the measure of association. In the deployment stage, we use the dictionary to predict which of the controlled vocabulary subject headings best describe new documents when they are presented to the system. Our evaluation of this algorithm, in which we compare the automatically assigned subject headings to the subject headings assigned to the test documents by human catalogers, shows that we can obtain results comparable to and consistent with human cataloging. In effect, we have cast this as a classic partial match information retrieval problem. We consider the problem to be one of “retrieving” (or assigning) the most probably “relevant” (or correct) controlled vocabulary subject headings to a document based on the clues contained in that document. ∗To whom all correspondence should be addressed
منابع مشابه
An Approach to Automatic Indexing of Scientific Publications in High Energy Physics for Database SPIRES HEP
We introduce an approach to automatic indexing of e-prints based on a patternmatching technique making extensive use of an Associative Patterns Dictionary (APD), developed by us. Entries in the APD consist of natural language phrases with the same semantic interpretation as a set of keywords from a controlled vocabulary. The method also allows to recognize within e-prints formulae written in TE...
متن کاملBibliographic database access using free-text and controlled vocabulary: an evaluation
This paper evaluates and compares the retrieval effectiveness of various search models, based on either automatic text-word indexing or on manually assigned controlled descriptors. Retrieval is from a relatively large collection of bibliographic material written in French. Moreover, for this French collection we evaluate improvements that result from combining automatic and manual indexing. Fir...
متن کاملAutomatic Indexing for Research Papers Using References
An effective way to reveal the contents of research papers is assigning a group of terms against a controlled vocabulary. To the best of our knowledge, a variety of automatic indexing techniques have been studied to enhance the effectiveness and the efficiency. However, the current approaches depended on the content of a research paper, such as title, abstract, etc., which suffering from limita...
متن کاملLOHAI: Providing a Baseline for KOS based Automatic Indexing
Automatic KOS based indexing – i.e. indexing based on a restricted, controlled vocabulary, a thesaurus or a classification – can play an important role to close the gap between the intellectually, high quality indexed publications and the mass of unindexed publications. Especially for unknown, heterogeneous publications, like web publications, simple processes that do not rely on manually creat...
متن کاملComparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty
Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Tw...
متن کاملSemantically Enhanced Automatic Keyphrase Indexing
The goal of this PhD thesis is to elaborate methods for automatic keyphrase indexing with a controlled vocabulary. Keyphrases are single words or multi-word lexemes that concisely and accurately describe the subject or an aspect of the subject discussed in a document. They are widely used in large document collections such as digital libraries and document repositories. They help organize mater...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JASIS
دوره 49 شماره
صفحات -
تاریخ انتشار 1998